HaploQA is a web application developed by Kieth Sheppard for The Reinholdt’s Lab in 2015 for the purpose of storing, increasing accessibility and quality controlling of micro array mice genotyping data. According to the HaploQA about website, ‘On the import of a sample, the haplotype probabilities that are represented in the plots are calculated.’ and the haplotype reconstruction process for the contributing founder strains is then done based on the calculated probabilities. The source codes of HaploQA was written in Python 3 and JavaScript.
R/qtl2 is an open-source software package built in 2019, which not only performs similar haplotype reconstruction tasks as HaploQA but also has the functionality to perform other computations such as genetic mapping - in the pipeline developed for this project,the software takes genotyping data as input, calculate the haplotype probabilities based on a Hidden Markov Model algorithm, then perform haplotype reconstructions based on the quantitative results.
Since R/qtl2 was built more recently and is currently maintained by the developers, the Jackson Laboratory is considering transitioning from HaploQA to R/qtl2 generalized AIL (Advanced intercross lines) for similar haplotype reconstructing tasks with genotyping data. The main purpose of this project is to decide if the transition should be made determine based on whether HaploQA or R/qtl2 genail provides more accurate haplotype reconstruction results.
The best way to determine which method provides more accurate results is to compare the results of the two methods each to the ‘correct’ results - in R/qtl2, there exists some cross-specific models which the Hidden Markov model was tailored to generate the most accurate results for a particular cross. These models will be referred to as the ‘truth’ models in this document.
In this project, we selected some GigaMUGA crosses, developed pipelines that retrieve genotyping data used in the initial HaploQA algorithm, performs haplotype reconstruction computations for both qtl2 genail and haploqa, and compare the percentages of markers within each method that agree with the optimal model of the according crosses to determine whether HaploQA or R/qtl2 gen-AIL would be the most appropriate for similar haplotype reconstruction tasks. The analysis was also done for MiniMUGA, as it is the standard cross type JAX uses for mutant mice, however, there is no tailored truth model for MiniMUGA, so the amount of marker differences between HaploQA and qtl2 was used as the main metrics for MiniMUGA to determine the model performances.
The main repository is https://github.com/TheJacksonLaboratory/HaploQA_qtl2_comp which contains the main pipelines developed for this project, all utility functions, and visualizations of selected results.
There is no need to pre-download anything from any other websites or manually create any directories, as those processes were automated by the pipelines. However, the automation depend heavily on the CSV files stored in the directory, and without proper set-ups of the CSV files, the pipeline will not run. Please make sure the following CSVs are present in the directory after cloning the repo:
annotations_config.csv - configuration file containing all technical set-ups for the pipeline, details will be explained in the next section.
founder_lookup_table.csv - lookup table containing all possible geno codes (AA, BB, AB, AY, BY, etc.) and their according number codes (1, 2, 3, 4, etc.)
founder_strains_table.csv - lookup table containing the standard 8 founder strains (A/J, C57BL/6J, etc.) and their according letter codes (A, B, etc.) This is primarily used for CC and DO to ensure the 8 founders are encoded in the correct order.
The pipelines developed for this project is stored in the script, ‘haplotype_reconstruction_pipelines.R’, which is stored in the same working directory as this report and is sourced below:
library(rstudioapi)
library(qtl2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ stringr 1.4.0
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::read_csv() masks qtl2::read_csv()
library(httr)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(data.table)
##
## Attaching package: 'data.table'
##
## The following object is masked from 'package:purrr':
##
## transpose
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(qtl2convert)
library(ggrepel)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following objects are masked from 'package:data.table':
##
## dcast, melt
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(stats)
library(ggplot2)
library(stringr)
library(stringi)
library(lsa)
## Loading required package: SnowballC
library(readr)
root <- dirname(getSourceEditorContext()$path)
#source(paste0(root,"/input_data_prep_functions.R"))
source(paste0(root,"/haplotype_reconstruction_pipelines.R"))
results_dir <- file.path(root, 'results')
shiny_pct_fp <- file.path(results_dir, 'shiny_pct_csvs')
# simulated phenotypes
list_pheno <- c('WBC', 'NEUT')
There are three major pipelines in the repository:
‘haplotype_reconstruction_pipeline’ - haplotype reconstructions for an entire cross (all individuals within the cross)
‘sample_haplotype_reconstruction’ - haplotype reconstructions for one individual
‘truth_model_reconstruction’ - haplotype reconstructions for optimal/truth models
Some crosses, such as DO and CC, have the same contributing founders across all individuals. The haplotype reconstructions can therefore be conducted for all individuals in the entire cross at the same time, and the pipeline ‘haplotype_reconstruction_pipeline’ can be used for crosses with consistent contributing founders for such purpose. Whereas the mutant crosses, such as MiniMUGA, contains individuals with mutate mice where all individuals may have different contributing founder strains. Haplotype reconstructions for these types of crosses need to conducted individually for each sample, and therefore, the pipeline ‘sample_haplotype_reconstruction’ should be used.
To run a cross, make sure to modify the configuration file, ‘annotations_config.csv’, that’s stored in the environment. This file allows the pipeline to automatically obtain all relevant information associated with each cross such as the sample URL, input data directory, output data directory, marker annotation file names, etc. The purpose of this configuration file is to avoid having to pass a large list of parameters into each pipeline function.
A preview of the configuration file is shown as below:
config <- fread(paste0(root, '/annotations_config.csv'))
print(config)
## array_type annot_file data_dir qtl2_dir
## 1: CC gm_uwisc_v2.csv haploqa_cc_gm cc_qtl2_genail
## 2: MiniMUGA mini_uwisc_v2.csv haploqa_minimuga minimuga_qtl2_genail
## 3: DO gm_uwisc_v2.csv haploqa_do do_qtl2_genail
## 4: BXD gm_uwisc_v2.csv haploqa_bxd_f3 bxd_f3_qtl2_genail
## 5: F2 gm_uwisc_v2.csv haploqa_f2 f2_qtl2_genail
## 6: MURGIGV01 gm_uwisc_v2.csv haploqa_gm_mutant gm_mutant_qtl2_genail
## marker_type
## 1: GigaMUGA
## 2: MiniMUGA
## 3: GigaMUGA
## 4: GigaMUGA
## 5: GigaMUGA
## 6: GigaMUGA
## url
## 1: http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html
## 2: http://haploqa-dev.jax.org/tag/MiniMUGA.html
## 3: http://haploqa-dev.jax.org/tag/J:DO.html
## 4: http://haploqa-dev.jax.org/tag/BXD%20F3%20re-upload.html
## 5: http://haploqa-dev.jax.org/tag/f2.html
## 6: http://haploqa-dev.jax.org/tag/Jackson_Lab_Mahaffey_MURGIGV01_20160102_FinalReport.html
## founders_list
## 1: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
## 2:
## 3: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
## 4: C57BL/6J, DBA/2J
## 5: 129S1/SvImJ, C57BL/6J
## 6:
## ngen
## 1: 4
## 2: 4
## 3: 4
## 4: 3
## 5: 3
## 6: 4
## exclude_samples
## 1: AF8, AF9, AFA, AFB, AFD
## 2: DJN, DK5, DK9, DKA, DKK, DM6, DMD, DMK, DMQ, DN2, DNQ, DPG, DPS, DPX, DQ5, DRD, DS5, DSD, DW6, DWW, DWY
## 3:
## 4:
## 5:
## 6:
## founder_url
## 1: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 2: http://haploqa-dev.jax.org/tag/MiniMUGA_Reference.html
## 3: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 4: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 5: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## 6: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## truth_model
## 1: risib8
## 2:
## 3: do
## 4: ail
## 5: f2
## 6:
If the target cross exists in the config file, the pipeline will:
get the filename entered in the ‘annot_file’ column as the annotation file from ‘https://raw.githubusercontent.com/kbroman/MUGAarrays/main/UWisc/’
get the input data files for all individuals in the cross from the HaploQA url entered in the ‘url’ column
create a folder in the working directory (if not already exists) with the name entered in ‘data_dir’ column and outputs the scraped input files to this directory
create a folder in the working directory (if not already exists) with the name entered in ‘qtl2_dir’ column to output all the files for qtl2 input
get the founder data from the url given in ‘founder_url’ column
run qtl2 with the given value in the ‘ngen’ column
identify the appropriate crosstype to get the optimal model for the given cross using ‘truth_model’
identify the founders present in the cross. This should be left blank for the mutant crosses or any other crosses where the number of founders ia not consistent between individuals
use ‘marker_type’ to identify the conditions of certain if loops in the pipeline. Should be ‘GigaMUGA’ or ‘MiniMUGA’ for the current pipelines
If the cross that the user is running haplotype reconstructions for does not exist in the config file, add a new row and fill each column accordingly.
Below is an example of how the pipeline set-up looks like for CC:
# cross type
sample_type <- 'CC'
### Environment
config_sample <- config[config$array_type == sample_type] # get config info for this cross type
marker_type <- config_sample$marker_type # type of marker (GigaMUGA/MiniMUGA)
founders_list <- unlist(strsplit(config_sample$founders_list, ", ")) # contributing founders
n_founders <- length(founders_list) # number of founders used in reconstruction
ngen <- config_sample$ngen # ngen parameter used in cross
sample_url <- config_sample$url # URL to retrieve data from haploqa for this cross
founder_url <- config_sample$founder_url # URL to retrieve founder data for GigaMUGA/MiniMUGA
truth_model <- config_sample$truth_model # cross type of truth model for the cross
# directory to store input data retrieved from haploqa
data_dir <- file.path(root, config_sample$data_dir)
# directory to store input data for qtl2
qtl2_dir <- file.path(root, config_sample$qtl2_dir)
print(config_sample)
## array_type annot_file data_dir qtl2_dir marker_type
## 1: CC gm_uwisc_v2.csv haploqa_cc_gm cc_qtl2_genail GigaMUGA
## url
## 1: http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html
## founders_list
## 1: 129S1/SvImJ, A/J, C57BL/6J, CAST/EiJ, NOD/ShiLtJ, NZO/HlLtJ, PWK/PhJ, WSB/EiJ
## ngen exclude_samples
## 1: 4 AF8, AF9, AFA, AFB, AFD
## founder_url
## 1: http://haploqa-dev.jax.org/tag/UNC_Villena_GIGMUGV01_20141012_FinalReport.html
## truth_model
## 1: risib8
From the above set up, the scripts will look for gm_uwisc_v2.csv in https://raw.githubusercontent.com/kbroman/MUGAarrays/main/UWisc/ as the annotation file. The raw input data files will be scraped from http://haploqa-dev.jax.org/tag/Collaborative%20Cross.html and these files will be stored in a folder, ‘haploqa_cc_gm’, in the same directory as the repo, for further wrangling. The wrangled data that’s ready for qtl2 read_cross function will be stored in a folder ‘cc_qtl2_genail’ also in the same directory as the repo. qtl2 will be use ngen = 4 to run haplotype reconstructions. The optimal model of CC is ‘risib8’ in qtl2.
The below results section will demonstrate how the pipelines are used for each cross and the results generated from them.
As mentioned in the introduction, in order to decide whether to transition, the accuracy of haploqa and qtl2 each need to be compared to an optimal model. The pipelines in the sources script perform haplotype reconstruction tasks using genotyping data obtained from different types of data sources, specifically, 5 sets of GigaMUGA samples (Collaborative Cross, Diversity Outbred, F3, F2, and a GigaMUGA mutant cross), 70 individuals within the MiniMUGA sample. Decision was made to not include Muga/MegaMUGA data due to the discontinued usage of the array types.
The pipeline for GigaMUGA executes all the individuals within the sample. The contributing founders for all individuals within the GigaMUGA samples are always the same.
The pipeline to run haplotype reconstructions for an entire cross can be initiated using the function ‘haplotype_reconstruction_pipeline’. Please set toggle ‘samples_gen’ = True if the genotyping data for individuals need to be (re)downloaded from HaploQA, and set to False otherwise. please set toggle ‘qtl2_file_gen’ = F if the input files for qtl2 read_cross need to be (re)generated, and False otherwise.
The function automatically saves the computed haplotype reconstruction results to a number of RDS files, which you can find in the /results/RDS/ directory of this repo. Whenever the ‘qtl2_file_gen’ toggle is set to True to regenerate qtl2 input files for a cross, the according result RDS files for that cross need to be manually removed from the /results/RDS/ directory for the haplotype reconstruction results to be properly updated, as the functions that calculate discordance metrics and construct visualizations read the results from these RDS files.
do_results <- haplotype_reconstruction_pipeline('DO', list_pheno, qtl2_file_gen = F, samples_gen = F)
Below are the genotype visualizations for some individuals in DO:
Karyotype plot - sample 6UY in DO
Karyotype plot - sample 6WN in DO
cc_results <- haplotype_reconstruction_pipeline('CC', list_pheno, qtl2_file_gen = F, samples_gen = F)
Karyotype plot - sample 35V in CC
Karyotype plot - sample 6MV in CC
bxd_results <- haplotype_reconstruction_pipeline('BXD', list_pheno, qtl2_file_gen = F, samples_gen = F)
Karyotype plot - sample 5PD in F3
Karyotype plot - sample 5R2 in F3
f2_results <- haplotype_reconstruction_pipeline('F2', list_pheno, qtl2_file_gen = F, samples_gen = F)
Karyotype plot - sample 8RS in F2
Karyotype plot - sample 8T3 in F3
The below two code chunks represent how the pipeline is used for the mutant mice crosses, where the contributing founders vary between crosses. The process involve a certain extent of filtering because some samples within the cross may only have one contributing founder, which does not work for genAIL as the algorithm requires two or more founders to perform haplotype reconstructions, and there is no simple way of identifying how many contributing fouders are present in each individual.
The function below again, saves the RDS files that contain haplotype reconstruction results for all samples in the same RDS directory as above (/results/RDS/), please make sure to remove th existing RDS file if the results need to be regenerated. The haplotype reconstruction results for each individuals can also be found in the directory where the qtl2 input files for read_cross are being stored - which should be the specified ‘qtl2_dir’ value in the config file.
gm_res_dir <- file.path(results_dir, 'RDS', 'gm_results.rds')
gm_pct_dir <- file.path(shiny_pct_fp, 'gm_pct_comp.csv')
if (!file.exists(gm_res_dir)) {
# ### GigaMUGA mutant
gigamuga_url <- 'http://haploqa-dev.jax.org/tag/Jackson_Lab_Mahaffey_MURGIGV01_20160102_FinalReport.html'
url_domain <- 'http://haploqa-dev.jax.org/'
marker_type <- 'GigaMUGA'
html_table <- read_html(gigamuga_url) %>% html_nodes("a") %>% html_attr("href")
url_list <- paste0(url_domain, html_table[grepl('/sample', html_table)])
summary_df <- sample_summary_scrape(read_html(gigamuga_url), url_list, marker_type)
summary_df$`% No Call` <- as.numeric(gsub("%", "", summary_df$`% No Call`))
summary_df <- summary_df[summary_df$`% No Call` < 10,]
ind_gm <- summary_df[summary_df$`Haplotype Candidate` == 'False',]$`ID`
ind_gm <- ind_gm[!ind_gm %in% c('RY', 'S8', 'SG', 'SR', 'Y2', 'YA', 'YJ', 'YT', 'XZ', 'Y9')] # only one founder
gigamuga_results <- list()
gm_pct_res <- list()
skipped_inds <- list()
for (ind in ind_gm) {
print(ind)
skip <- FALSE
tryCatch(sample_gm_res <- sample_haplotype_reconstruction('MURGIGV01', ind, samples_gen = F, qtl2_file_gen = F), error = function(e) {skip <<- TRUE})
if(skip) {
skipped_inds <- c(skipped_inds, ind)
next}
df <- ind_geno_comp(sample_gm_res, ind, 'MURGIGV01')
gm_pct_res[[ind]] <- df
gigamuga_results[[ind]] <- sample_gm_res
}
print(paste0('Skipped individuals:', skipped_inds))
gm_pct_res <- rbindlist(gm_pct_res)
saveRDS(gigamuga_results, gm_res_dir)
} else {
gigamuga_results <- readRDS(gm_res_dir)
}
if (!file.exists(gm_pct_dir)) {
write.csv(gm_pct_res, gm_pct_dir, row.names = F)
} else {
gm_pct_res <- fread(gm_pct_dir)
}
The selected GigaMUGA mutant cross contains some mice with two contributing founders and some genetically diverse DO mice:
Karyotype plot - sample ND in GigaMUGA Mutant
Karyotype plot - sample YE in GigaMUGA Mutant
mini_res_dir <- file.path(results_dir, 'RDS', 'MiniMUGA_results.rds')
mini_pct_dir <- file.path(shiny_pct_fp, 'mini_pct_comp.csv')
if (!file.exists(mini_res_dir)) {
### MiniMUGA pipeline
minimuga_url <- 'http://haploqa-dev.jax.org/tag/MiniMUGA.html'
url_domain <- 'http://haploqa-dev.jax.org/'
marker_type <- 'MiniMUGA'
html_table <- read_html(minimuga_url) %>% html_nodes("a") %>% html_attr("href")
url_list <- paste0(url_domain, html_table[grepl('/sample', html_table)])
summary_df <- sample_summary_scrape(read_html(minimuga_url), url_list, marker_type)
ind_mini <- summary_df[summary_df$`Haplotype Candidate` == 'False',]$`ID`
ind_mini <- ind_mini[ind_mini!='JXX', 'JY6'] # screentime error, skip this one
ind_mini <- ind_mini[ind_mini!='JY9'] # only 1 contributing strain, AIL incompatible, skip this one as well
minimuga_results <- list()
mini_pct_res <- list()
skipped_inds <- list()
for (ind in ind_mini) {
print(ind)
skip <- FALSE
tryCatch(sample_mini_res <- sample_haplotype_reconstruction('MiniMUGA', ind, samples_gen = F, qtl2_file_gen = F), error = function(e) {skip <<- TRUE})
if(skip) {
skipped_inds <- c(skipped_inds, ind)
next}
df <- ind_geno_comp(sample_mini_res, ind, 'MiniMUGA')
minimuga_results[[ind]] <- sample_mini_res
mini_pct_res[[ind]] <- df
}
print(paste0('Skipped individuals:', skipped_inds))
mini_pct_res <- rbindlist(mini_pct_res)
saveRDS(minimuga_results, mini_res_dir)
} else {
minimuga_results <- readRDS(mini_res_dir)
}
if (!file.exists(mini_pct_dir)) {
write.csv(mini_pct_res, mini_pct_dir, row.names = F)
} else {
mini_pct_res <- fread(mini_pct_dir)
}
Karyotype plot - sample JXN in MiniMUGA
Karyotype plot - sample JYA in MiniMUGA
A dashboard was built in R Shiny to visualize the karyotype plot results of the haplotype reconstructions, the script to the Shiny app is ‘app.R’ in this directory. The below code chunk should be able to launch the Shiny app.
source(paste0(root,"/app.R"))
shinyApp(ui, server)
To summarize the results, for GigaMUGA, we selected 277 samples from DO, 218 samples from CC, 71 samples from F3, 34 samples from F2, and 288 mutant samples. For MiniMUGA, we selected 25 samples. Among which, DO and CC are complex crosses with 8 contributing founders, F3 and F2 are relatively simple two-founder crosses, whereas the number of contributing founders in the mutant crosses samples can vary from 2 to 8. This is to ensure there are sufficient evidences to evaluate the accuracy of haplotype reconstruction results for crosses of all computational sizes and complexity.
There are three parts of the comparison: 1. haploqa results, which can be obtained directly from haploqa.jax.org by downloading the SNP level haplotype reports from each sample. 2. qtl2 genAIL results, which was calculated using the R package ‘qtl2’ with the crosstypes ‘genail’ 3. qtl2 best model, which was calculated using the qtl2package with the original crosstype
At a glance of the visualizations which present a high-level idea of the difference between the haplotype reconstruction results between qtl2 and haploqa, there are some visible differences between the two models - specifically, qtl2 appears to have more crossovers and come minor areas appear to have different founder results. To determine which model is more correct, some ‘truth’ models, which the algorithms have been used (for a long time?), were also constructed for each of the crosses.
As mentioned in the configuration file, the truth model for CC uses crosstype ‘risib8’, the truth model for DO uses crosstype ‘do’, the truth model for BXD F3 uses crosstype ‘ail’, and the truth model for F2 uses crosstype ‘f2’. The executions of pipelines for the truth models can be initiated using the function, ‘truth_model_reconstruction’. Same as the previous haplotype reconstruction functions, please set toggle ‘samples_gen’ = True if the genotyping data for individuals need to be (re)downloaded from HaploQA, and set to False otherwise. please set toggle ‘qtl2_file_gen’ = F if the input files for qtl2 read_cross need to be (re)generated, and False otherwise. Please also remove the existing result RDS files for the according cross (if any) whenever the qtl2 input files were regenerated to make sure the results are updated properly.
do_truth_results <- truth_model_reconstruction('DO', list_pheno, qtl2_file_gen = F, samples_gen = F)
cc_truth_results <- truth_model_reconstruction('CC', list_pheno, qtl2_file_gen = F, samples_gen = F)
bxd_truth_results <- truth_model_reconstruction('BXD', list_pheno, qtl2_file_gen = F, samples_gen = F)
f2_truth_results <- truth_model_reconstruction('F2', list_pheno, qtl2_file_gen = F, samples_gen = F)
For visualizations, we constructed boxplots that demonstrate the percentage of marker that do not have the same results between qtl2, haploQA and the truth model:
# filepath to the csvs
# DO
do_pct_df <- fread(file.path(shiny_pct_fp, 'do_truth_comp.csv'))
# CC
cc_pct_df <- fread(file.path(shiny_pct_fp, 'cc_truth_comp.csv'))
# BXD f3
bxd_pct_df <- fread(file.path(shiny_pct_fp, 'bxd_truth_comp.csv'))
# f2
f2_pct_df <- fread(file.path(shiny_pct_fp, 'f2_truth_comp.csv'))
# GigaMUGA Mutant
gm_pct_res <- fread(file.path(shiny_pct_fp, 'gm_pct_comp.csv'))
# MiniMUGA
mini_pct_res <- fread(file.path(shiny_pct_fp, 'mini_pct_comp.csv'))
# wrangle csv files for plots
do_qtl2_truth_diff <- do_pct_df$qtl2_pct_diff
cc_qtl2_truth_diff <- cc_pct_df$qtl2_pct_diff
bxd_qtl2_truth_diff <- bxd_pct_df$qtl2_pct_diff
f2_qtl2_truth_diff <- f2_pct_df$qtl2_pct_diff
# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))
length(do_qtl2_truth_diff) <- max_n
length(cc_qtl2_truth_diff) <- max_n
length(bxd_qtl2_truth_diff) <- max_n
length(f2_qtl2_truth_diff) <- max_n
df_qtl2_truth_diff <- as.data.frame(cbind('do' = do_qtl2_truth_diff, 'cc' = cc_qtl2_truth_diff, 'f3' = bxd_qtl2_truth_diff, 'f2' = f2_qtl2_truth_diff))
plot_df <- melt(df_qtl2_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('qtl2 - proportion discordance between optimal and implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).
According to the visualization, which shows the percentage of marker discordance between qtl2 genail and the truth models - DO, F3 and F2 are all producing results that are almost exactly the same as the optimal model, with CC having some discrepancies. We believe the reason why CC performs so differently is that the optimal model for CC does not allow any heterozygosity, whereas the genail does, so the genail model showed some residual heterozygosity from the CC mice which the optimal model does not have the ability to show.
do_haploqa_truth_diff <- do_pct_df$haploqa_pct_diff
cc_haploqa_truth_diff <- cc_pct_df$haploqa_pct_diff
bxd_haploqa_truth_diff <- bxd_pct_df$haploqa_pct_diff
f2_haploqa_truth_diff <- f2_pct_df$haploqa_pct_diff
# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))
length(do_haploqa_truth_diff) <- max_n
length(cc_haploqa_truth_diff) <- max_n
length(bxd_haploqa_truth_diff) <- max_n
length(f2_haploqa_truth_diff) <- max_n
df_haploqa_truth_diff <- as.data.frame(cbind('do' = do_haploqa_truth_diff, 'cc' = cc_haploqa_truth_diff, 'f3' = bxd_haploqa_truth_diff, 'f2' = f2_haploqa_truth_diff))
plot_df <- melt(df_haploqa_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('haploqa - proportion discordance between optimal and implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).
According to the visualization, which shows the percentage of marker differences between haploqa and the truth models - the haploqa results for f2 and f3 still seem to be similar to the optimal model, but DO and CC seem to be relatively incorrect results.
do_haploqa_truth_diff <- do_pct_df$haploqa_qtl2_pct_diff
cc_haploqa_truth_diff <- cc_pct_df$haploqa_qtl2_pct_diff
bxd_haploqa_truth_diff <- bxd_pct_df$haploqa_qtl2_pct_diff
f2_haploqa_truth_diff <- f2_pct_df$haploqa_qtl2_pct_diff
# align lengths
max_n <- max(nrow(do_pct_df), nrow(cc_pct_df), nrow(bxd_pct_df), nrow(f2_pct_df))
length(do_haploqa_truth_diff) <- max_n
length(cc_haploqa_truth_diff) <- max_n
length(bxd_haploqa_truth_diff) <- max_n
length(f2_haploqa_truth_diff) <- max_n
df_haploqa_truth_diff <- as.data.frame(cbind('do' = do_haploqa_truth_diff, 'cc' = cc_haploqa_truth_diff, 'f3' = bxd_haploqa_truth_diff, 'f2' = f2_haploqa_truth_diff))
plot_df <- melt(df_haploqa_truth_diff)
## No id variables; using all as measure variables
library(ggplot2)
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('haploqa vs qtl2 - proportion discordance between implemented models')
## Warning: Removed 508 rows containing non-finite values (stat_boxplot).
For the above three plots, there are two significant outliers at 80% discordance showing for CC, which are samples V6 and VP. Their haplotype reconstruction results appear to be highly heterozygous as CC mice, which explains the high discordance rate. However, it is believed that these two mice samples may have been mistakenly labeled as CC, as the genomes of CC mice are generally homozygous. Further investigation may be needed on the two samples.
Below are the karyotype plots for the two samples:
The two mutant mice crosses, GigaMUGA Mutant and MiniMUGA, do not have a tailored optimal model. Therefore, plotting the percentage of marker discordance between qtl2 genail and haploQA.
MiniMUGA:
mini_plot <- mini_pct_res %>% group_by(sample_id) %>% summarise(qtl2_pct_diff = sum(haplo_diplotype != qtl2_calls)/n()) %>% as.data.frame()
plot_df <- melt(mini_plot)
## Using sample_id as id variables
plot_df$variable <- 'MiniMUGA'
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('MiniMUGA haploqa vs genail - % of marker discordance') + xlab('model')
The MiniMUGA samples are mutant mice samples, meaning they often have founders that contribute to only a few markers. The above boxplot shows a relatively significant high number of markers that are different between haploQA and qtl2 genail, which shows a weakness for qtl2: due to the way HMM models determines genome states based on probabilities - when they are extra founders, or founders that only contribute to a few markers, qtl2 genail tends to be ‘confused’ and bounce states in the haplotype reconstruction process, whereas haploQA is capable of staying in states.
The outlier at approximately 70% is sample JXV, which according to its karyotype on HaploQA, this sample has a significant mount of genetic variations for a mutant mice. This could also affect the way qtl2 decides the geno phases and cause the discordance
Karyotype plot - sample JXV in MiniMUGA
GigaMUGA mutant:
gm_plot <- gm_pct_res %>% group_by(sample_id) %>% summarise(qtl2_pct_diff = sum(haplo_diplotype != qtl2_calls)/n()) %>% as.data.frame()
plot_df <- melt(gm_plot)
## Using sample_id as id variables
plot_df$variable <- 'GigaMUGA Mutant'
ggplot(plot_df) + geom_boxplot(aes(x = variable, y = value)) + ggtitle('GigaMUGA Mutant haploqa vs genail - % of marker discordance') + xlab('model')
The results between genail and haploQA appear to be highly similar, with an average of less than 2% marker discordance between the models. The outlier at approximately 50% is the sample, ‘PX’. The karyotype plot is show as below:
Karyotype plot - sample PX in GigaMUGA Mutant
Overall, the existing results for qtl2 genail and haploqa show that the two methods have very similar performance for most of the crosses, with genail performing significantly better for the DO outbred mice samples. HaploQA handles the extra founders being present in the haplotype reconstruction process better, however, qtl2 genail model has stronger capabilities to produce more accurate results for complicated crosses. So based on the results we have so far, we’re leaning toward that genail is more accurate than haploqa.
Aside from the statistical metrics comparison between the haplotype reconstruction results, it is also worth taking the programmatic strengths and weaknesses of the two models into consideration:
HaploQA:
advantages:
Not a programming tool - people without relevant knowledge can still use the tool and perform haplotype reconstruction
Works for all MUGA array-based genotyping data
Detailed and concise visualizations on web interface
disadvantages:
haplotype reports are not easy to access computationally
source codes are also not easy to access computationally
expensive to maintain at the organization’s costs
R/qtl2:
advantages:
fully accessible computationally
capable of performing other genetic analysis
currently maintained by developers
disadvantages:
algorithms is probability based, tends to bounce states when low-contribution founders are present
visualizations are less detailed and need to be generated on demand
does not scale well with large number of founders, run time increases exponentially
Genoprob runtime plot
Broman KW, Gatti DM, Simecek P, Furlotte NA, Prins P, Sen Ś, Yandell BS, Churchill GA. R/qtl2: Software for Mapping Quantitative Trait Loci with High-Dimensional Data and Multiparent Populations. Genetics. 2019 Feb;211(2):495-502. doi: 10.1534/genetics.118.301595. Epub 2018 Dec 27. PMID: 30591514; PMCID: PMC6366910.
Known issues:
The webpages throws a screentime error while downloding some individuals from haploQA - this is presumably a Mac issue. The functions automatically skip the samples with screentime error issue, however, this may become an issue when a screentime error rises up while getting the founder data because the function would skip that founder, and the user needs to manually download it. Please make sure that all founders are present before proceeding with the pipelines.
Error in dev.off() : QuartzBitmap_Output - unable to open file ’/Users/linc/.local/share/rstudio/notebooks/AFB3D6E3-draft_report/1/8779F85C5B04ECDA/cvdr4ccbjgppi_t/_rs_chunk_plot_010.png’
This is likely an R markdown issue or some unclosed connections in the backend environment after knitting the R markdown - please make sure to clear the workspace or restart R if this happens.